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A common bottleneck in evaluating extremal performance measures is that, due to their very nature, tail data 
are often very limited. The conventional approach selects the best probability distribution from tail data using 
parametric fitting, but the validity of the parametric choice can be difficult to verify. This paper describes 
an alternative based on the computation of worst-case bounds under the geometric premise of tail convexity, 
a feature shared by all common parametric tail distributions. We characterize the optimality structure of 
the resulting optimization problem, and demonstrate that the worst-case convex tail behavior is in a sense 
either extremely light-tailed or extremely heavy-tailed. We develop low-dimensional nonlinear programs that 
distinguish between the two cases and compute the worst-case bound. We numerically illustrate how the 
proposed approach can give more reliable performances than conventional parametric methods. 
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1. Introduction 


Modeling extreme behaviors is a fundamental task in analyzing and managing risk. As the earliest 
applications, hydrologists and climatologists study historical data of sea levels and air pollutants 
to estimate the risk of flooding and pollution (Gumbel (2012[)). In non-life or casualty insurance, 


insurers rely on accurate prediction of large losses to price and manage insurance policies (McNeil 


(1997 

), 

Beirlant and Teugels ( 

1992 

),|Embrechts et al. ( 


mate risk measures of portfolios to safeguard losses (Glasserman and Li (2005), Glasserman et ah 


(2007, [2008)). In engineering, measurement of system reliability often involves modeling the tail 


behaviors of individual components’ failure times (Nicola et ah (1993), Heidelberger (1995)) 
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Despite its importance in various disciplines, tail modeling is an intrinsically difficult task 
because, by their own nature, tail data are often very limited. Consider these two examples: 

Example 1 (Adopted from [McNeil| ( [19971 ) ). There were 2,156 Danish fire losses over one 
million Danish Krone (DKK) from 1980 to 1990. The empirical eumulative distribution function 
(ECDF) and the histogram (in log scale) are plotted in Figure^ For a conerete use of the data, 
an insuranee company might he interested in pricing a high-excess eontract with reinsurance, which 
has a payoff of X — 50 (in million DKK) when 50 < A < 200, 150 when X > 200, and 0 when 
X < 50, where X is the loss amount (the marks 50 and 200 are labeled with vertical lines in Figure 
[^. Pricing this contract would require, among other information, E[payoff\. Flowever, only seven 
data points are above 50 (the loss amount above which the payoff is non-zero). 




Figure 1: ECDF and histogram for Danish fire losses from 1980 to 1990 

Example 2. A more extreme situation is a synthetic data set of size 200 generated from an 
unknown distribution, whose histogram is shown in Figure Suppose the quantity of interest is 
P(4 < A < 5). This appears to he an ill-posed problem since the interval [4,5] has no data at all. 
This situation is not uncommon when in any application one tries to extrapolate the tail with a 
small sample size. 

The purpose of this paper is to develop a theoretically justified methodology to estimate tail- 
related quantities of interest such as those depicted in the examples above. This requires drawing 
information properly from data not in the tail. We will illustrate how to do this and revisit the 
two examples later with numerical performance of our method. 
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Figure 2 Histogram of a synthetic data set with sample size 200 

2. Our Approach and Main Contributions 

We adopt a nonparametric approach. Rather than fitting a tail parametric curve when there can 
be few or zero observations in the tail region, we base our analysis on the geometric premise that 
the tail density is convex. We emphasize that this condition is satisfied by all known parametric 
distributions (e.g. normal, lognormal, exponential, gamma, Weibull, Pareto etc.). For this reason 
we believe it is a natural and minimal assumption to make. 

In any given problem, there can be potentially infinitely many feasible candidates of convex tails. 
The central idea of our method is a worst-case characterization. Formally, given information on the 
non-tail part of the distribution and a target quantity of interest (e.g., P(4 < X < 5) in Example 
[^, we aim to find a convex tail, consistent with the non-tail part, that gives rise to the worst-case 
value of the target (e.g., the largest possible value of P{4 < X < 5)). This value serves as a tight 
bound for the target that is robust with respect to the ambiguity of the tail, without using any 
particular tail knowledge other than our a priori assumption of convexity. 

Our proposed approach requires solving an optimization over a potentially infinite-dimensional 
space of convex tails. As our key contributions, we show that this problem has a very simple 
optimality structure, and find its solution via low-dimensional nonlinear programs. In particular: 
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1. We characterize the worst-case tail behavior under the tail convexity condition. We show 
that the worst-case tail, for any bounded target quantity of interest, is in a sense either extremely 
light-tailed or extremely heavy-tailed. Both cases can be characterized by piecewise linear densities, 
the distinction being whether the pieces form a bounded support distribution or lead to probability 
masses that escape to infinity. 

2. We provide efficient algorithms to distinguish between the two cases above, and to solve for 
the optimal distribution in each case. For a large class of objectives, the algorithm requires at most 
a two-dimensional nonlinear program. 

Our approach outputs statistically valid worst-case bounds when integrating with confidence 
estimates drawn from the non-tail portion of the data. This approach uses the convexity assumption 
to get around the difficulty faced by conventional parametric methods (discussed in detail in the 
next section) in directly estimating the tail curve, by effectively mitigating the estimation burden 
to the central part of the density curve where more data are available. However, we pay the price of 
conservativeness: our method can generate a worst-case bound that is over-pessimistic. We therefore 
believe it is most suitable for small sample size, when a price of conservativeness is unavoidable in 
trading with statistical validity. 

The remainder of this paper is organized as follows. Section [^discusses some previous techniques 
and reviews the relevant literature. Section presents our formulation and results for an abstract 
setting. Section studies the numerical solution algorithm. Section focuses on integrating these 
results with data. Section [3 shows some numerical illustration. Section |8] concludes and discusses 
future work. Some auxiliary theorems and proofs are left to the Appendix. 

3. Related Work 


3.1. Overview of Common Tail-fitting Techniques 

As far as we know, all existing techniques for modeling extreme events are parametric-based, in 
the sense that a “best” parametric curve is chosen and the parameters are ht to the tail data. 


The classic text of Hogg and Klugman (2009) provides a comprehensive discussion on the common 
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choices of parametric tail densities. While exploratory data analysis, snch as quantile plots and 
mean excess plots, can provide guidance regarding the class of parametric curves to use (such as 
heavy, middle or light tail), this approach is limited by its reliance on a large amount of data in 
the tail and subjectivity in the choice of parametric curve. 

Beyond the goodness-of-fit approach, there are two widely used results on the parametric choice 
that is provably suitable for extreme values. The Fisher-Tippett-Gnedenko Theorem (jFisher and 


Tippett (1928), Gnedenko (1943)) postulates that the sample maxima, after suitable scaling, must 


converge to a generalized extreme value (GEV) distribution, given that it converges at all to some 
non-degenerate distribution. This result is useful if the data are known to derive from the maximum 
of some distributions. For instance, environmental data on sea level and river heights are often 
collected as annual maxima (Davison and Smith ( ]1990 )), and in this scenario it is sensible to fit the 
GEV distribution. In other scenarios, the data have to be pre-divided into blocks and blockwise 
maxima have to be taken in order to apply GEV, but this blockwise approach is statistically 
wasteful (jEmbrechts et al. (2005)). 


The Pickands-Balkema-de Haan Theorem (Pickands III (1975), Balkema and De Haan (1974)) 
does not require data to come from maxima. Rather, the theorem states that the excess losses over 
thresholds converge to a generalized Pareto distribution (GPD) as the thresholds approach infinity, 
under the same conditions as the Fisher-Tippett-Gnedenko Theorem. The Pickands-Balkema-de 
Haan theorem provides a solid mathematical justification for using GPD to fit the tail portion of 
data (McNeil (1997), Embrechts et al. (2005|)). Fitting GPD can be done by well-studied procedures 


such as maximum likelihood estimation (Smith (1985)), and the method of probability-weighted 
moments (Hosking and Wallis ( 1987|)). The Hill estimator (Hill et al. (1975), Davis and Resnick 


(1984)) is also a widely used alternative. 

Despite the attraction and frequent usage, fitting GPD suffers from two pitfalls: First, there is 
no convergence rate resnlt that tells how high a threshold should be for the GPD approximation to 


be valid (e.g. McNeil (1997)). Hence, picking the threshold is an ad hoc task in practice. Second, 































































6 


Lam and Mottet: Worst-case Tail Analysis 
Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 


and more importantly, even if the threshold chosen is sufficiently high for the approximation to 
hold, a large amount of data above it is needed to accurately estimate the parameters in GPD. In 
our two examples, especially Example this is plainly impossible. 


3.2. Related Literature on our Methodology 

Our mathematical formulation and techniques are related to two lines of literature. The use of 
convexity and other shape constraints (such as log-concavity) have appeared in density estimation 


Gule et al. 

2010) 

Seregin and Wellner (2010), ^ 

Ooenker and Mizera 

(2010)) and convex regression 


Seijo et al. 

(2011), 

Hannah and Dunson ( 

20K 

)), Lim and Glynn 

2012)) in statistics. A major 


reason for using convexity in these statistical problems is the removal of tuning parameters, such 
as bandwidth, as required by other methods such as the use of kernel. 

The second line of related literature is optimization over probability distributions, which have 


appeared in decision analysis (Smith (1995), Bertsimas and Popescu (2005), Popescu (2005)), 


robust control theory (jlyengar (2005), El Ghaoui and Nilim (2005), Petersen et al, 


Hansen 


and Sargent (2008)), distributionally robust optimization (Delage and Ye (2010), Goh and Sim 


(2010)), and stochastic programming (Birge and Wets (1987), Birge and Dula (1991)). The typical 
formulation involves optimization of some objective governed by a probability distribution that 


is partially specified via constraints like moments (Karr (1983), Winkler (1988)) and statistical 


distances (Ben-Tal et al. (2013)). Our formulation differs from these studies by its pertinence to 
tail modeling (i.e., knowledge of certain regions of the density, but none beyond it). Among all the 


previous works, only Popescu (2005) has considered convex density assumption, as an instance of a 
proposed class of geometric conditions that are added to moment constraints. While the result bears 
similarity to ours in that a piecewise linearity structure shows up in the solution, our qualitative 
classification of the tail, the solution techniques, and the formulation in integrating with data all 


differ from the semidefinite programming approach in Popescu (2005). 

4. Abstract Formulation and Results 


We begin by considering an abstract formulation assuming full information on the distribution up 
to some threshold, and no information beyond. The next sub-sections give the details. 
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4.1. Formulation 

Consider a continuous probability distribution on M whose density exists and is denoted by /(x). 
We assume that / is known up to a certain large threshold, say a G M. The goal is to extrapolate 


/• 

We impose the assumption that /(x), for x > a 
known up to a, and Figures]^ andeach show an 
Observe that the convex tail assumption excludes 
curve. 



a X 


, is convex. Figure [^shows an example of an /(x) 
example of convex and non-convex extrapolation, 
any “surprising” bumps (and falls) in the density 



Figure 3: A probability density /(x) known up 
to a threshold a 



Figure 5: An example of non-convex tail 
extrapolation 


Figure 4: An example of convex tail extrapo¬ 
lation 

/W 



Figure 6: The parameters /3 


Now suppose we are given a target objective or performance measure E[h{X)]^ where E[-] denotes 
the expectation under /, and /i: M —)> M is a bounded function in X. The goal is to calculate the 
worst-case value of E[h{X)] under the assumption that / is convex beyond a. That is, we want 













Lam and Mottet: Worst-case Tail Analysis 
Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 


to obtain max£'[/i(X)] = f^^h{x)f{x)dx where the maximization is over all convex f{x),x > a 
such that it satishes the properties of a probability density function. We assume that the density 
is left-differentiable at a, so that a convex extrapolation at a can be suitably defined. For the 
formulation, we need three constants extracted from f{x),x < a, which we denote as > 0 

respectively: 

1. ry is the value of the density / at a, i.e. /(a) = rj. 

2. —V is the left derivative of / at a, i.e. f'_{a) = —v. We impose the condition that the right 
derivative A(a) > /!_(a) = —u. Note that, since / is convex (and bounded) on [a, oo), its one-sided 
derivative exists everywhere on [a, oo). 

3. /3 is the tail probability at a. Since / is known up to a, f{x) is known to be equal to 
some number 1 — /3, and f{x)dx must equal /3. 

Figure [^illustrates these quantities. For ry,!/,/3 > 0, our formulation can be written as 


max 

/ 

pOO 

/ h{x)f{x)dx 

J a 


subject to 

[ f{x)dx = f3 

J a 

(la) 


f{a) = f{a+) = r} 

(lb) 


/+(“) > -i' 

(Ic) 


f convex for x>a 

(Id) 


f{x) > 0 for X > a 

(le) 


Note that we have set our objective to be E[h{X);X > a], since E[h{X);X < a] is completely 
known in this setting. Here f{a+) denotes the right-limit at a, and f{a) = f{a+) means that / is 
right-continuous at a, implying a continuous extrapolation at a. 

4.2. Optimality Characterization 

The solution structure of Q turns out to be extremely simple and is characterized by either 
one of two closely related cases (focusing on the region x>a). Let C+[a,oo) denote the class of 
non-negative continuous functions on [ 0 , 00 ). Let 

7^£+[a,oo) = {/ gC+[o,cx)) : f{x) = Cj + djX for x G [yyj_i,yj], j = l,...,m. 
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where a = yo < yi < ■ ■ ■ < ym < oo, Cj,dj G M, and f{x) = 0 for x > ?/„} 

be the set of all non-negative, continuous and piecewise linear functions on [a,oo) that have at 
most m line segments before vanishing. We have: 

Theorem 1 . Suppose h is measurable and hounded. Consider optimization Q . If it is feasible, 
then either 

1. An optimal solution f* exists, where f* ^VCl[a,oo). 

2. An optimal solution does not exist. There exists a sequence GP/ig [a,oo) : A: > 1}, each 
feasible for Q, such that h{x)f^^\x)dx ^ Z* as k^oo, where Z* is the optimal value of 

Q. Moreover, let + df''x : x be the last line segment of f^'"'>. We have y^^^ oo 

and \ 0 as oo. 

The proof of Theorem [T] is discussed in the next sub-sections. Note that /* in the first case in 
Theorem is a continuous piecewise linear density, and consequently has bounded support. In the 
second case, as k ^ oo, the sequence : A: > 1} has unboundedly increasing support endpoint 
( 2 / 3 ^^ ^ 00 ), and its last line segment gets closer and more parallel to the horizontal axis (dg^^ 0). 

This sequence possesses a pointwise limit, but the limit is not a valid density and has a probability 
mass that “escapes” to positive infinity. 

Figures and show the tail behaviors for the two cases above. A bounded support density in 
the first case possesses the lightest possible tail behavior. The second case, on the other hand, can 
be interpreted as an extreme heavy-tail. Compare the sequence with a given arbitrary density. 
Given any fixed large enough x on the real line, as k grows, the decay rate of at the point x is 
eventually slower than that of the given density. Since a slower decay rate is the characteristic of 
a fatter tail, the behavior implied by in a sense captures the heaviest possible tail. 
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/(a:) 


X 


X 


Figure 7: Behavior of an optimal light-tailed 
extrapolation 


Figure 8: Behavior of an element in an optimal 
heavy-tailed extrapolation sequence 


4.3. Main Mathematical Developments 

This section presents the mathematical argument for Theorem This development will also help 
construct a solution algorithm in Section We divide the argument into two parts. First we 
establish an equivalence of Q to a moment-constrained optimization problem under a different 
probability space. Second, we characterize the solution of this moment-constrained problem, which 
can then be converted to the solution of Q 

We define some notations. Let M"*" and be the non-negative and non-positive real axis. Denote 
V{A4) as the set of all probability measures on a measurable space Ai equipped with the Borel 
(7-field. Let Si = {{pi,... ,pi) G (M+)' : Yl\=iPi = 1} the Z-dimensional probability simplex. Let 
(5(-) be the Dirac measure. Denote Vn{M.) as the set of all finite support distributions on M with at 
most n support points, i.e. each P G Vn{Ai) has masses Pi,P 2 , ■ ■ ■ ,Pi gSi on points Xi ,... ,xi £ Ai, 
where I <l <n, defined such that P = simplicity, since any P G Vn{AA) can be 

represented by the support points (xi,..., x„) G Af” (some possibly identical) and (pi,... ,Pn) G 
we sometimes write P ~ (xi,..., x„,pi,... ,p„) for a given P G VniAi). Moreover, we use the notation 
E[-] to denote the associated expectation under P. 
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For convenience, denote as the set of all probability measures concentrated on 

M+, and the corresponding set of measures with at most n support points. The 

measurability of h is assumed throughout the rest of the exposition. 


4.3.1. Equivalence to Moment-constrained Optimization We first reformulate 0 as: 
Lemma 1. Formulation Q is equivalent to 

/ OO 

h{x)f{x)dx 

POO 

subject to / f{x)dx = jd (2a) 

J a 

f{a) = V (2b) 

/(^(x) exists and is non-decreasing and right-continuous for x>a (2c) 

—i'<ff{x)<Oforx>a (2d) 

A(x) — >■ 0 as X —>■ OO (2e) 

fix)= [ f'+{t)dt + T]forx>a (2f) 

Proof of Lemma^ The proof uses several elementary results from convex analysis. See 
Appendix EC.l for details. □ 

As a key step, we show the equivalence of Q to a moment-constrained program, by identifying 
the decision variable as /(_(x) via a one-to-one map with a probability distribution function. Let 

n u 

h{v-\-a)dvdu (3) 


and 


Tj 2/3 

^i = — and (j = — 

v V 


( 4 ) 


where /i, cr > 0 since we have assumed 0. Our result is: 

Theorem 2. Suppose h is bounded. The optimal value of 0 is equal to that of 

max z/E[iL(X)] 

subject to E[X] = p, 

E[X^]=a 


( 5 ) 


EeP+ 
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Here the decision variable is a probability measure P G V'^, and E[-] is the corresponding expectation. 
Moreover, there is a one-to-one correspondence between the feasible solutions to Q and , given 
by /+(x + a) = n{p{x) — 1) for x G M+, where ff is the right derivative of a feasible solution f of 
(§ such that f{x) = ff{t)dt -\- rj for x>a, and p is a probability distribution function that is 
associated with a feasible probability measure over in (§. 

Proof of Theorem^ The key of the proof uses integration by parts and an explicit construction 
of a linear transformation between ff and a probability distribution function p. See Appendix 
lEC.ll for details. □ 


4.3.2. Further Reduction and Optimality Characterization Next we characterize the 
optimality structure for ([^, a generalized moment problem in the form of an infinite-dimensional 
linear program. Using existing terminology, we call an optimization program consistent if there 
exists a feasible solution, and solvable if there exists an optimal solution. 

For convenience, denote OPTifD) as the program 


max v¥.[H{X)] 

subject to E[A] = /i 
E[A2] =cj 

PgP 

where H,yL,a are dehned in Q and Q, and 2? is a collection of probability measures on M. For 
example, program ([^ is denoted as OPT{V^). Moreover, let Z(F) = iyF[H{X)] be the objective 
function of OPT{T>) in terms of P. We have: 


Theorem 3. Program ([^, or equivalently OPT{V^), has the same optimal value as OPT{Vf). 


Proof of Theorem 


Appendix EC.l 


Follows from a classical result on the extreme points of moment sets. See 

□ 


Next we derive some properties regarding the optimality of OPT[Vf): 
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Proposition 1. Consider OPT[P^) that is consistent. The optimal value Z* is either 
achieved at some P* G , or there exists a sequence of feasible P^^^ G Vf such that 
2'(p(^)) Z*. In the second case, each P^''^ ~ such that either 

{x^i\x 2 ^\x^^\p^i\pf^\p^^^) —>■ {x\,X 2 ,oo,p\,P 2 ,^) for some x\,X 2 G (possibly identical) and 
{p*i,P* 2 ) or {xf \xi^\xf\pf\pi^\pf ^) (xi,cx),cx), 1,0,0) for some x); gM+. 


Proof of Proposition^ See Appendix |EC.l 
We are now ready to show Theorem 


□ 


Proof of Theorem Convert the original optimization 0 into ([^ by Lemma[^and Theorem]^ 
If ([^ is consistent, then, by Theorem]^ its optimal value is attained by the two cases in Proposition 
Note that that any solution PgP 3 [ 0 ,oo) represented by {xi,X 2 ,x^,pi,p 2 ,p^) (with potentially 
identical xfs) admits one-to-one correspondence with a solution / in Q, via f({x + a) = v{p{x) — 1 ) 
in Theorem giving 


/+(®)= ' 


—V ioi a<x <Xi + a 

—v{l—pi) ioY Xi +a<x <X 2 +a 
—z/(l—pi—P 2 ) for 0 : 2 -|-a < X < X 3 -|-a 
0 for X 3 -|- a < X 


f{x) = < 


( 6 ) 


and hence 

/ 

T] — u{x — a) ioi a <x <Xi + a 

rj — uxi — v{\ — pi)(x — a — Xi) for Xi -|- a < x < X2 -|- a 

rj — nxi — 1/(1 —pi)(x2 — Xi) — z/(l —pi —p2){x — a — X2) for X2 -|- a < x < X3 -|- a 

0 for X3 -|- a < X 

The hrst case in Proposition[^thus concludes Part of TheoremIn the second case in Proposition 
Xg^^ —> 00 and pg^^ —^ 0 so that 1 — pg^^ — p^ 2 ^ —> 0. Using Q, we conclude Part of Theorem 


□ 


We close this section with two results. First is on the consistency of programs Q and 0: 
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Lemma 2. Program ([^ is consistent if and only ifcr>^^. Correspondingly, program 0 is con¬ 
sistent if and only if < 2 fiv. When a = pf, Q has only one feasible solution given by 5 {p). 
Correspondingly, when jf = 2 / 3 z/, 0 has only one feasible solution given by f{x) = ii — u{x — a) for 
x>a. 


Proof of Lemma\^ See Appendix EC .2 


□ 


Graphically, r]^ > 2 fdi' implies that 13 is smaller than the area under the straight line starting 
from the point (0,77) down to the x-axis with slope —u. Hence no convex extrapolation can be 
drawn under this condition. 

Next, we show that the boundedness assumption on h is nearly essential, in the sense that any 
polynomially growing h leads to an infinite optimal value for 0: 

Proposition 2. Suppose rf < 2 l 3 i> and h{x) = as x —)> 00 for some e > 0. The optimal value 

of Q is 00. 

Proof of Proposition^ The proof explicitly constructs a sequence of feasible solutions that lead 

□ 


to exploding objective values. See Appendix EC.l 


5. Optimization Procedure for Quasi-concave Objectives 


This section develops a numerical solution algorithm for our worst-case optimization presented in 
Section]^ In building our algorithm, we focus on h that satisfies the following stronger assumption, 
which covers many natural scenarios including the two examples in the Introduction. 

Assumption 1. The function h : M — >■ M"*" is bounded, and is non-decreasing in [a,c) and non¬ 
increasing in (c,00) for some constant a<c<oo (i.e. c can possibly be 00). 

Assumption [^implies that h is quasi-concave. The non-negativity of h is assumed without loss of 
generality when applied to optimization Q. Because h is bounded, one can always add a sufficiently 
large constant, say C, to make h non-negative. Note that we have E[h{X)]X > a] = E[h{X) -\- 
C-,X>a\ — CP{X >a) = E[h{X) -|- C; A > a] — Cf 3 , and so one can solve E[h{X) -|- C; A > a] and 
recover ^[^(A); A > a]. 

We impose an additional mild regularity assumption: 
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Assumption 2. The limit 


A = lim 

X—>-oo 


H{x) 


where H is defined in Q, exists and is finite. 


(7) 


Note that when h is bounded, H{x) = 0{x^) as x —)■ oo, and limsup,j_,, 3 Q if(x)/x^ < oo. The essence 
of Assumption is on the existence of the limit. 

Under Assumption denote 


W{xi) = V 


a — n 


a — 2^Xi + xf 


Hixi) + 


-M 


a — ^Xi 


ih-XiY 

u —2/rxi + Xi \fi — Xi 


( 8 ) 


with lU(/i) := + A(cr — ^^)), where /U and a are defined in Q. We have the following 

strengthened version of Theorem 


Theorem 4. Under Assumptions^ 

1. The conclusions of TheoremS^hold with VC^[a,oo) replaced by ^£^[0,00). 

2. Suppose rf < 2/3p and Assumption^holds additionally. The optimal value of Q is given by 

VU(xi). 

3. Suppose rf < 2/3z/ and Assumption holds additionally. If there exists x^ G 
argmax^^^^Q^^]W{xi) such that x* G [0,/i), then an optimal solution to Q is given by 


T] — p(x — a) 


for a<x<xl + a 


nx) = { 




V - I'XI - {x-a-xD forx\ + a<x<^^ + a 

+ a < X 


Otherwise, there exists a sequence of feasible solutions with h{x)f^’^\x)dx ^ Z*, where Z* 
is the optimal value of Q. — >■/* pointwise where 


r{x) 


rj — v^x — a) fora<x<pL + a 
0 for p, + a<x 


The second case can occur only when A > 0. 
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Part of Theorem simplifies the search space of densities in Q from three to two linear 
segments. Because of this simplification, solving Q reduces to finding the first kink of the optimal 
density (or sequence of densities), equivalently the first support point of the reformulation Q . This 
can be done by a one-dimensional line search max^^gjo^^jj W{xi) in Part of the theorem. 

Part of Theorem describes how to distinguish between the light- and heavy-tail cases in 
Theoremby looking at the location of x\. The former case occurs when there exists a x\ in [0,/i), 
and the latter occurs otherwise. Note that f*{x) = 0, x> + a in the pointwise limit of in 
Part of Theorem is a consequence of the last line segment of getting increasingly closer 
and more parallel to the x-axis. 

Algorithm 1 summarizes the procedure for obtaining the optimal value of ([^. 

Algorithm 1: Procedure for finding the optimal value of (Q 
Inputs: 

1. The function h that satisfies Assumptions and 

2. The parameters /3, ?], > 0. 

Procedure: 

1. If > 2/3z/, there is no feasible solution. 

2. If 7]^ = 2/3z/, the optimal value is uH{fjL). 

3. If rj^ < 2/3z/, the optimal value is given by maXa;jg[o,^] IT(xi). 


The rest of this section provides the developments for proving Theorem First we introduce 
the following condition; 

Assumption 3. H is convex and H' satisfies a convex-concave property, i.e. H'{x) is convex for 
X G (0, c) and concave for x G (c, oo), for some 0 < c < oo. 

With Assumption Theorem can be strengthened to: 

Proposition 3. Under Assumption^ OPT{V^) has the same optimal value as OPT{Vf). 
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Proof of Proposition^^ See Appendix |EC.2 


□ 


This allows us to focus on one of the support points of OPT{Vf ) in the solution scheme, leading 
to the following proposition: 


Proposition 4. Under Assumptions^ an(i[^ consider OPT{Vf) with a > pf and let Z* he its 
optimal value. 

1. If there exists an optimal solution in , then this solution has distinct support points and is 
represented by {xl,xl,pl,pl), where xl £ argmax^^^^Q^^^W{xi) and 


Xn = 


a — p,x\ 

■ 

H-xl 


Pi = 


a — pi 


a — 2/ixJ + xl 


*2 ’ 


P2 = 


ih-xir 


a - 2pix\ + xl 


*2 


(9) 


Moreover, Z* = max^-^gio ,,)W{x,). 

2. If there does not exist an optimal solution, then there must exist a sequence ~ 
{x[^\x 2 ^\p[^\p 2 ^'^) ^ {pt,oo,l,0). Moreover, Z* = u{H{p) + \{a — pf)). 

3. Z* = max„^g[o,^] W(xi) 


Proof of Proposition^ See Appendix |EC.2 


□ 


The following corollary provides a simple sufhcient conditions for guaranteeing the light-tail case 
in the solution scheme: 


Corollary 1. Suppose Assumptions^ and^ hold and Q is consistent. An optimal solution for 
0 must exist if X = 0. 

Proof of Corollary^ By Lemma[^ consistency of Q implies a > p^. By Theoremj^and Propo¬ 
sition]^ it suffices to consider the equivalent program OPTlVif). Suppose X = 0. If a = p"^, then 
S{p) is an optimal solution. If a> p^, then by Proposition]^ if there is no optimal solution, its 
optimal value must be v{H{p) + A(cr — p^)) = vH[p), which is attained by 5{p) and leads to a 
contradiction (to both the hypotheses of no optimal solution and a > p^). □ 

We are now ready to show Theorem]^ 

Proof of Theorem^ Proof offTl AssumptionJ^implies Assumption]^ By Theorem]^and Propo¬ 
sition ]^ program 0 has the same optimal value as that of OPT{Vf). Similar to the proof of 
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Theoremthe result follows by noting that any P G represented by {xi,X 2 ,Pi,P 2 ) (with poten¬ 
tially identical Xi’s), admits one-to-one correspondence with a solution / in Q, via /(_(x-|-a) = 
u{p{x) — 1) in Theorem]^ giving 




f'+i^) = ^ 


-^P2 


for a < X < Xi + a 
for Xi + a < x < X2 + a 
for X2 + a < X 


and hence 


fix) = < 


p — vix — a) for a < X < Xi -|- a 

p — vxi — vp 2 ix — a — Xi) for xi -|- a < X < X2 -|- a 
0 for X2 + a < X 


( 10 ) 


Proof ofEl The condition p^ < 2(]i/ is equivalent to u > pf. The conclusion follows from Part 
in Proposition]^ 

Proof of|^ The first case is obtained by substituting x\ G argmax^^gjQ _^)lT(xi) and x;,p* 
from (© , in Part in Proposition into The second case is obtained by substituting 

ix^i \x^\p['^\p'" 2 ^) in Part in Proposition into (10) and taking the limit. The last conclusion 
follows from Corollary [Tj □ 


6. Formulation and Procedure under Data-driven Environment 

Sections]^ andhave discussed our worst-case approach in the abstract setting where the values of 
the needed parameters 13, p,^ are completely known. In practice, these parameters are not directly 
specified. Instead, they are calibrated from data in the non-tail region. Suppose we obtain confi¬ 
dence intervals (CIs) for P(X > a) and /(a) and a lower confidence bound for /l_(a), jointly with 
confidence level 1 —a. Denote them as [/3,/5], [p,p] and —P. Suppose (3, f3,p,p,P > 0. We substitute 
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these estimates for the exact values of /3, ry and — in our worst-case bound for E[h{X)',X > a]: 

max h{x)f{x)dx 

subject to P < f{x)dx < P 

V<f{a) = f{a+)<V 
/+(a) > -V 
f{x) convex for x > a 
f{x) > 0 for X > a 

It is immediate that the optimal value of © carries the following statistical guarantee: 

Proposition 5. Suppose that [P,P], [r],v] —v are the joint (1 — a)-level CIs for P{X > a) 
and f{a), and lower confidence hound for f'_{a). Then with probability 1 — a (with respect to the 
data) optimization © gives an upper bound for E[h{X);X > a] under the assumption that f{x) 
is convex for x>a and f{a) = f{a+). 

Proof of Proposition^ Let ftrue{x), x > a he the ground-true density, and Ztme = 
h{x)ftrue{x)dx. Let Z* and P be the optimal value and feasible region of © • If ftrue G P, then 
> Ztrue- Hence Pdata(-^* > Ztrue) > Pdatfiftrue G = 1 “ where Pdata denotes the probability 
with respect to the data. □ 

For h that has support spanning across both X < a and X > a, one approach is to estimate 
E[h{X);X < a] separately from the computation of the worst-case bound from ©• The former 
can be done typically by using the empirical mean as the non-tail region X < a possesses more 
data to rely on. This segregated approach, however, only allows the conditions of valid probability 
density on the whole real line (e.g., J^f{x)dx = I) and the continuity at a to hold approximately 
but not exactly. 

The following result presents the optimality structure for @ in parallel to formulation Q. 

Theorem 5. Suppose h is bounded. Consider optimization ©. If it is feasible, then either 
1. An optimal solution f* exists, where f* GP£^[a,oo). 
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2. An optimal solution does not exist. There exists a sequence ^VC^[a,oo) : /c > 1}, each 
feasible for Q, such that h{x)f^^\x)dx ^ Z* as k^oo, where Z* is the optimal value of 
( [IT| ). Moreover, let {cg^^ + d!'^^x : x G be the last line segment of f^^\ We have Z'oo 

and \ 0 as /c —)• cx). 


Proof of Theorem\^ See Appendix EC.3 
Define 


V _ fj 2/? _ 2/3 

T==, h=-, cr = — 

— V V V V 


where p., > 0 since we have assumed /3, /3, ri,rj,u> 0. Define 


N -( p — , iuj — xY ^.^fp — ujx 

p) = P ( ^ _ if (X) + ^ _ 


with >V(w,w,p) := z/(i?(w) + A(p — w^)), where H and A are defined as in Q and Q. 
For convenience, we also denote 


K.{x\xi,uj, p) = < 


voj — v{x — a) 


for a < X < Xi + a 


voj — vxi — v a (x —a —Xi) for Xi + a < x < + a 

^ p—2ujxi+xf ' — — tx — xx 


for + a < X 

UJ — Xl - 


□ 


( 12 ) 


Our data-integrated optimization 0 possesses the following consistency property in parallel to 
the fixed-parameter case in Lemma 


Lemma 3. Program 0 is consistent if and only if < 2fdiy or equivalently a > p?. When = 
2/3z7 or equivalently a = , 0 has only one feasible solution given by /(x) = p — i^(x — a) for 
X > a. 


Proof of Lemma^ The proof is similar to Lemmaand hence skipped. □ 

The following provides the solution scheme for our data-integrated optimization 0: 

Theorem 6. Under Assumption^ 

1. The conclusions of Theorem^hold with VC'^[a,oo) replaced by ^*£^[0,00). 
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Suppose rj'^ < 2/3i^ and Assumption^holds additionally. The optimal value of ( [Il| is given by 


max < max yV{xi,n,p), max >V(xi,a;,cj) 

I [Typ'^ 5 ^] [O^m] [P'^pA \^] ^xi G [0,a;] 


(13) 


3. Suppose rf < 2/3i' and Assumption holds additionally. Suppose 

ar 5 max^g[g.y ^2 5 ] /o) such that x\ G [0,/l), then an optimal solution to is 

given by f*{x) = lC{x]x\,Jl: P*) ■ Otherwise, there exists a sequence of feasible solutions f^O yjifh 
y h{x)f^O(^x)dx ^ Z*, the optimal value of (0. such that f^O f* pointwise where 


ry = 


ri — n{x — a) fora<x<yL + a 


0 for p, + a<x 

which can occur only when A > 0. On the other hand, suppose maXpg[^v^ 2 _^]_^jg[o_jj] W(xi, 71 ,p) < 
™ax^G[M,MAv^],a.iG[o.<.;]>V(xi,a;,a). If there exists G such 

that X* G [OjW*), then an optimal solution to 0 is given by f*{x) = JC{x]x\,ljj*,( j). Otherwise, 
there exists a sequenee of feasible solutions f^O yjHh h{x)f^O ^x)dx ^ Z*, sueh that f^O f* 
pointwise where 


nx) = 


uu* — v{x — a) for a <x <ijj* + a 


0 


for uj* + a<x 


which again can occur only when A > 0. 


Proof of Theorem Optimization 0 follows from a reduction of the inequality-based gener¬ 
alized moment problem converted from 0 into two subproblems. Appendix |EC.3| provides the 
constituent propositions and further details. □ 

Algorithm 2 presents our procedure for solving 0 . 


Algorithm 2: Procedure for Finding the Optimal Value of ([IT|) 
Inputs: 

1 . The function h that satisfies Assumptions and 

2. The parameters /5, /3, 77 , ry, E > 0. 
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Procedure: 

1. If rf > there is no feasible solution. 

2. If rf = 2j3v, the optimal value is 

3. If ? 7 ^ < 2j3v, the optimal value is 

max < max yV{xi,Ji,p), max yV{xi,oj,a) 

I pS[CTV7I^,CT],aiG[0,7I] tJG[/i,7tA\/?],a:iG[0,tj] 


7. Numerical Examples 

We present some numerical performance of our algorithm. We first consider several elementary 
examples, and then we will revisit the two examples in the Introduction. All computations are 
performed using the R software. For transparency and reproducibility purposes, the codes are made 
available on GitHub at the following url https://github.com/cmottet/RobustTailNumerics, 


7.1. Elementary Examples 


We consider three examples to demonstrate Algorithm I. 


Entropic Risk Measure: The entropic risk measure (e.g., Fbllmer and Schied (2011)) captures 


the risk aversion of users through the exponential utility function. It is defined as 


p{X) = - \og[E [e-^^]) 


(14) 


where 0 > 0 is the parameter of risk aversion. In the case when the distribution of the random 
variable X is known only up to some point a, we can find the worst case value of the entropic risk 
measure subject to tail uncertainty by solving the optimization problem 


max - 
PdA 0 


log (E [e ) = - log ^£'[e ; X < a] + maxE [e (15) 


where A denotes the set of convex tails that match the given non-tail region. Since the function 
satisfies Assumption we can apply Algorithm 1 to the second term of the RHS of (15). 


The thick line in Figure [^represents the worst-case value of the entropic risk measure for different 
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values of the parameter 6 in the case when X is known to have a standard exponential distribution 
Exp{l) up to a = — log(0.7) (i.e. a is the 70-percentile and (3 = rj = v = 0.7). For comparison, 
we also calculate and plot the entropic risk measure for several fitted probability distributions: 
Exp{l), two-segment continuous piecewise linear tail denoted as 2-PLT (two such instances in 
Figure]^, and mixtures of 2-PLT and shifted Pareto. Clearly, the worst-case values bound those 
calculated from the candidate parametric models, with the gap diminishing as 9 increases. 



Figure 9 Optimal upper bound and comparison with parametric extrapolations for the entropic risk 
measure. 


The Newsvendor Problem: The classical newsvendor problem maximizes the profit of selling 
a perishable product by fulfilling demand using a stock level decision, i.e., 

max£'[pmin(g, D)] — eg (16) 

9 

where D is the demand random variable, p and c are the selling and purchase prices per product. 


and g is the stock quantity to be determined. We assume that p> c. The optimal solution to (16) 
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is given by Littlewood’s rule q* = F~^{{p — c)/p), where F~^ is the quantile function of D (Talluri 
and Van Ryzin ( |2006 )). 

Suppose the distribution of D is only known to have the shape of a lognormal distribution 
with mean 50 and standard deviation 20 in the interval [0,a), where a is the 70-percentile of the 


lognormal distribution. A robust optimization formulation for (16) is 


max min id [p min( 5 , H)] — cq 


(17) 


max ^E\pm.m{q, D);D < a] -|- min£'[pmin(( 7 , D);D > a] — cg| 


where A denotes the set of convex tails that match the given non-tail region. The outer optimization 
in is a concave program. We concentrate on the inner optimization. Since pmm{q, D) is a non¬ 
decreasing function in D on [0,oo), its negation is non-increasing, and Assumption holds (note 
that minimization here can be achieved by merely maximizing the negation). We can therefore 


apply Algorithm 1 (with f5 = 0.7, rj ss 0.007, and u ^ 0.0003). Figure 10 shows the optimal lower 


bound of the inner optimization when p = 7, c = 1 and q varies between 0 and 193.26 (which is 
the 95-percentile of the lognormal distribution). The curve peaks at q = 55.7, which is the solution 
to problem 0. As a comparison, we also show different candidate values of the expectation that 
are obtained by fitting the tails of lognormal, 2-PLT (two instances) and mixture of shifted Pareto 
and 2-PLT (see Figure [To]). 

Tail Interval Probability: Consider estimating probabilities of the type P{c < X < d). We 
compare the bound provided by Algorithm 1 with the “truth” that A is a Pareto distribution 
with tail index 1, i.e. P{X > x) = l/x for all x > 1, for different values of the threshold a and 
interval {c,d). On the heat map given in Figure [TT| the color of each rectangle represents the ratio 
between the computed upper bound and the true probability for some threshold a and interval 
{c,d). The X-value on the left hand side of each rectangle indicates the value of a (in percentile of 
the Pareto distribution) and the lower and upper y-values of the rectangle represent the interval 
{c,d) (also in percentile of the Pareto distribution). Each rectangle represents an area that has 
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Figure 10 Optimal objective values of the inner optimization of the robust newsvendor problem. 


Pareto probability mass exactly 0.01. We see that, for instance, when a is the 70*^ percentile and 
(c,d) is the (85*'*,86‘^)-percentile, then our bound is roughly two times that of Pareto, whereas 


(c,d)=(98*'*, 99*'*)-percentile gives roughly eight times. Figure 11 confirms the intuition that the 
smaller the distance between a and c, the less conservative is the bound and hence the closer is to 
the “truth”. 


7.2. Synthetic Data: Example Revisited 

Consider the synthetic data set of size 200 in Example This data set is actually generated from 
a lognormal distribution with parameter (^, cr) = (0,0.5), but we assume that only the data are 
available to us. We are interested in the quantity P(4 < X < 5), and for this we will solve program 
to generate an upper bound that is valid with 95% confidence. 

We compute the interval estimates for /3, rj and v as follows. First, we obtain point estimates for 
these parameters through standard kernel density estimator (KDE) in the R statistical package. 
To obtain interval estimates, we run 1,000 bootstrap resamples and take the appropriate quantiles 
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Figure 11 Ratio between the worst-case upper bound and the Pareto distribution for the quantity P{c< 
X < d) for different thresholds a and intervals (c, d). 


of the 1,000 resampled point estimates. To account for the fact that three parameters are esti¬ 
mated simultaneously, we apply a Bonferroni correction, so that the confidence level used for each 
individual estimator is 1 — 0.05/3. 


For a sense of how to choose a. Figure 12 shows the density and density derivative estimates 
and compares them to those of the lognormal distribution. The KDE suggests that convexity holds 
starting from around x = 1.5 (the point where the density derivative estimate starts to turn from a 
decreasing to an increasing function). Thus, it is reasonable to confine the choice of a to be larger 
than 1.5. In fact, this number is quite close to the true inflexion point 1.15. 

Since the data become progressively scanter as x grows larger, and the KDE is designed to utilize 
neighborhood data, the interval estimators for the necessary parameters /3, ij and become less 
reliable for larger choices of a. Eor instance, Eigure shows that the bootstrapped KDE Cl of 
the density derivative covers the truth only up to x = 3.1. In general, a good choice of a should 
be located at a point where there are some data in the neighborhood of a, such that the interval 
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estimators for 13, rj and v are reliable, but as large as possible, because choosing a small a can make 
the tail extrapolation bound more conservative. 


Density derivative function 


Density function 




1 2 3 4 5 


Taii distribution function 



1 2 3 4 5 


— Boostrap 95% Ci 
-- True function 


Figure 12 


Bootstrapped kernel estimation of the distribution, density and density derivative for the synthetic 
data. 


As a hrst attempt, we run Algorithm 2 using a = 3.1 to estimate an upper bound for the 
probability P(4 < X < 5), which gives 8.8 x 10“^ while the truth is 2.1 x 10“^. Thus, this estimated 
upper bound does cover the truth and also has the same order of magnitude. We perform the 
following two other procedures for comparison: 


1. GPD approach: As discussed in Section 3.1 this is a common approach for tail modeling. Fit 
the data above a threshold u to the density function 


{l-F{u))g^^^{x-u) 
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where F{u) is the estimated ECDF at u, and ^{■) is the GPD density, whose distribution function 
is defined as 

/ 

1-(l + Ca^//5)“^^^ if Ct^O 

GqA^) = < 

I 1 — exp(—x//3) if C>0 

for X > 0, and /3 > 0. Set the threshold u to be 1.8, the point at which a linear trend begins to be 


observed on the mean excess plot of the data, as recommended by McNeil (1997). Estimate F{u) 
by the sample mean of l{Xi — where /(•) denotes the indicator function. Obtain the parameter 
estimates C and (5 using the maximum likelihood estimator suggested by 
the delta method to obtain a 95% Cl of the quantity P{c < X < d). 


Smith 


(1987). Then use 


2. Worst-case approach with known parameter values: Assume 13, rj and v are known at a = 3.1. 
Then run Algorithm 1 to obtain the upper bound. 

Tableshows the upper bounds obtained from the above approaches, and also shows the obvious 
fact that using ECDF alone for estimating P{A < X < 5) gives 0 since there are no data in the 
interval [4,5]. The 95% Cl output by CPD fit is [—8.72 x 10“"^, 1.10 x 10“^], which does not bound 
the truth (note that this is a two-sided interval, and the upper bound would be off even more if 
it had been one-sided). The worst-case approach with known parameters gives an upper bound 
of 3.16 X 10“^, which is less conservative than the case when the parameters are estimated. The 
difference between these numbers can be interpreted as the price of estimation for (3, rj and u. 
For this particular setup, the worst-case approach correctly covers the true value, whereas CPD 
fitting gives an invalid upper bound, thus showing that either the data size or the threshold level 
is insufficient to support a good fit of the CPD. This is an instance where the worst-case approach 
has outperformed CPD in terms of correctness. 

Civen that the worst-case approach with estimated parameters appears conceivably more con¬ 
servative than with known parameters, we conduct a sensitivity study using only Algorithm 1. 
The first row in Table shows the upper bound output by Algorithm 1 using the point estimates 
of the parameters f3,r],v. The other rows in Table show the outputs of Algorithm 1 when some 
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Method Estimated upper bound 

Truth 2.14E-03 

ECDF O.OOE+00 

GPD l.llE-03 

Worst-case with known parameters 3.16E-03 

Worst-case appoach 8.80E-03 

Table 1 Estimated upper bounds of the probability P(4 < X < 5) for the synthetic data in Example 

values of the parameters are changed to the upper estimates of the 95% CIs. Some scenarios are 
omitted in the table because they lead to infeasibility. We see that among all these scenarios, the 
most conservative upper bound occurs when 13,1], u are all set to be the upper estimates, giving to 
8.67 X 10 ^ which is very close to using Algorithm 2. Note that some of these bounds do not cover 
the truth, which necessitates the use of the interval approach and Algorithm 2. 

The above discussion focuses only on the realization of one data set, which raises the question 
of whether it holds more generally. Therefore, we obtain an empirical probability of coverage by 
repeating the following procedure 100 times: 

1. Generate a lognormal sample of size 200 with parameters (/r,(T) = (0,0.5); 

2. Estimate fj, rj, (3, (3 and P at a chosen point a (see below); 

3. Use Algorithm 2 to compute the worst-case upper bound of P{c <X <d). 

We then estimate the coverage probability of our worst-case upper bound as the proportion of 
times that Algorithm 2 yields a bound that dominates the true probability P{c < X < d). We 
repeat this procedure for different [c,d] varying from [4,5] to [9,10], and for two different values of 
a given by 3.1 and 2.8. Tables and show the true probabilities, the mean upper bounds from 
the 100 experiments, and the empirical coverage probabilities. 

The coverage probabilities in Tables and are mostly 1, which suggests that our procedure is 
conservative. For a = 3.1 and intervals that are close to a, i.e. [c,d] = [4,5] and [5,6], the coverage 
probability is not 1 but rather is close to the prescribed confidence level of 95%. Further investi¬ 
gation reveals that our procedure fails to cover the truth only in the case when the joint Cl of the 
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fi rj V Worst-case upper bound 


Estimated value 

Estimated value 

Estimated value 

2.04E-03 

Estimated value 

Lower estimate 

Estimated value 

5.76E-06 

Estimated value 

Lower estimate 

Upper estimate 

5.76E-06 

Upper estimate 

Lower estimate 

Estimated value 

5.76E-06 

Upper estimate 

Lower estimate 

Upper estimate 

5.76E-06 

Estimated value 

Upper estimate 

Estimated value 

3.61E-04 

Estimated value 

Upper estimate 

Upper estimate 

1.62E-03 

Estimated value 

Estimated value 

Upper estimate 

2.05E-03 

Upper estimate 

Estimated value 

Upper estimate 

5.53E-03 

Upper estimate 

Estimated value 

Estimated value 

5.53E-03 

Upper estimate 

Upper estimate 

Estimated value 

8.30E-03 

Upper estimate 

Upper estimate 

Upper estimate 

8.67E-03 


Table 2 Sensitivity analysis of the worst-case upper bound of P(4 < X < 5) for the synthetic data in Example]^ 
generated by Algorithm 1, when f5,'q,v are changed from the point estimates to the upper estimates of the 95% CIs. 


c 

d 

Truth 

Mean upper bound 

Coverage probability 

4 

5 

2.14E-03 

1.03E-02 

0.94 

5 

6 

4.74E-04 

6.12E-03 

0.99 

6 

7 

1.20E-04 

4.33E-03 

1.00 

7 

8 

3.38E-05 

3.35E-03 

1.00 

8 

9 

1.04E-05 

2.74E-03 

1.00 

9 

10 

3.49E-06 

2.31E-03 

1.00 


Table 3 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold 


a = 3.1. 
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c 

d 

Truth 

Mean upper bound 

Coverage probability 

4 

5 

2.14E-03 

1.31E-02 

1.00 

5 

6 

4.74E-04 

8.26E-03 

1.00 

6 

7 

1.20E-04 

6.04E-03 

1.00 

7 

8 

3.38E-05 

4.76E-03 

1.00 

8 

9 

1.04E-05 

3.92E-03 

1.00 

9 

10 

3.49E-06 

3.34E-03 

1.00 


Table 4 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold 

a = 2.8. 

parameters r], (5 and v does not contain the true values, which is consistent with the rationale of 
our method. Although we have not tried lower values of a, it is very likely that in those settings 
the coverage probabilities will stay mostly 1, and the mean upper bounds will increase since the 
level of conservativeness increases. 

As a comparison, Table shows the results of GPD fit using the threshold u = 1.8. Here, all of 
the coverage probabilities are far from the prescribed level of 95%, which suggests that either GPD 
is the wrong parametric choice to use since the threshold is not high enough, or that the estimation 
error of its parameters is too large due to the lack of data. (Again, we have used a two-sided 95% 
Cl for the GPD approach here; if we had used a one-sided upper confidence bound, then the upper 
bounding value would be even lower and the coverage probability would drop further). However, 
the mean upper bounds using GPD fit do cover the truth in all cases. Since the coverage probability 
is well below 95%, this suggests that the estimation of GPD parameters is highly sensitive to the 
realization of data. 

In summary, Tables EE and show the pros and cons of our worst-case approach and GPD 
fitting. GPD is on average closer to the true target quantity, but its confidence upper bound can fall 
short of the prescribed coverage probability (in fact, only between 37 to 62% of the time it covers 
the truth in Table . On the other hand, our approach gives a reasonably tight upper bound when 
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c 

d 

Truth 

Mean upper bound 

Coverage probability 

4 

5 

2.14E-03 

3.87E-03 

0.62 

5 

6 

4.74E-04 

1.27E-03 

0.53 

6 

7 

1.20E-04 

5.48E-04 

0.51 

7 

8 

3.38E-05 

2.79E-04 

0.43 

8 

9 

1.04E-05 

1.62E-04 

0.40 

9 

10 

3.49E-06 

1.03E-04 

0.37 


Table 5 Mean upper bounds and empirical coverage probabilities using GPD. 

the interval in consideration (i.e. [c, d]) is close to the threshold a, and tends to be more conservative 
far out. This is a drawback, but sensibly so, given that the uncertainty of extrapolation increases 
as it gets farther away from what is known. 

Both our worst-case approach and GPD fitting require choosing a threshold parameter. In GPD 
fitting, it is important to choose a threshold parameter high enough so that the GPD becomes a 
valid model. GPD fitting, however, is difficult for a small data set when the lack of data prohibits 
choosing a high threshold. On the other hand, the threshold in our worst-case approach can be 
chosen much higher, because our method relies on the data below the threshold, not above it. 


7.3. Fire Insurance Data: Example Revisited 

Gonsider the fire insurance data in Example The quantity of interest is the expected payoff of a 
high-excess policy with reinsurance, given by h{x) = {x — 50)1(50 <x< 200) -|- 150I(x > 200). The 
data set has only seven observations above 50. 

We apply our worst-case approach to estimate an upper bound for the expected payoff by using 


a = 29.03, the cutoff above which 15 observations are available. Similar to Section 7.2, we use the 


bootstrapped KDE to obtain GIs for fj, r] and v. The estimates in Figure 13 appear to be very 
stable for this example, thanks to the relatively large data size. 

We run Algorithm 2 to obtain a 95% confidence upper bound of 1.99. For comparison, we fit a 


GPD using threshold u = 10, which follows McNeil (1997) as the choice that roughly balances the 
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Tail distribution function 

1 . 00 - 



6 10 20 30 


Figure 13 Bootstrapped kernel estimation of the distribution, density and density derivative for the the Danish 
fire losses data in Example 

bias-variance tradeoff. The 95% Cl from GPD fit is [—0.03,0.23]. Thus, the worst-case approach 
gives an upper bound that is one order of magnitude higher, a Ending that resonates with that in 


Section 7.2 Our recommendation is that a modeler who cares only about the order of magnitude 
would be better off choosing GPD, whereas a more risk-averse modeler who wants a bound on the 
risk quantity with high probability guarantee would be better off choosing the worst-case approach. 

8. Conclusion 


This paper proposed a worst-case, nonparametric approach to bound tail quantities based on the 
tail convexity assumption. The approach relied on an optimization formulated over all possible 
tail densities. We characterized the optimality structure of this infinite-dimensional optimization 
problem by developing an equivalence to a moment-constrained problem. Under additional quasi¬ 
concavity condition on the objective function, we constructed the numerical solution scheme by 
converting it into low-dimensional nonlinear programs. With the presence of data, this approach 
tractably generated statistically valid bounds via suitable relaxations of the optimization that took 
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into account the estimation errors of the required parameters. We compared our proposed approach 
to existing tail-fitting techniques, and demonstrated its relative strength of outputting correct tail 
estimates under data-deficient environments. We also examined the level of conservativeness of our 
bounds, which was viewed as a limitation of the proposed approach. 

We suggest two extensions of our research. First is to generalize the proposed method to mul¬ 
tivariate distributions, perhaps through separate modeling on the marginal distributions and the 
dependency structure. Second is to study means to reduce the level of conservativeness. This can 
involve mathematical transformations of the variable and the addition of extra information (e.g., 
other constraints). 

References 

Balkema, August A, Laurens De Haan. 1974. Residual life time at great age. The Annals of Probability 
792-804. 

Beirlant, Jan, Jozef L Teugels. 1992. Modeling large claims in non-life insurance. Insurance: Mathematics 
and Economics 11(1) 17-29. 

Ben-Tal, Aharon, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, Gijs Rennen. 2013. Robust 
solutions of optimization problems affected by uncertain probabilities. Management Science 59(2) 
341-357. 

Bertsimas, Dimitris, loana Popescu. 2005. Optimal inequalities in probability theory: A convex optimization 
approach. SIAM Journal on Optimization 15(3) 780-804. 

Birge, John R, Jose H Dula. 1991. Bounding separable recourse functions with limited distribution informa¬ 
tion. Annals of Operations Research 30(1) 277-298. 

Birge, John R, Roger J-B Wets. 1987. Computing bounds for stochastic programming problems by means 
of a generalized moment problem. Mathematics of Operations Research 12(1) 149-162. 

Cule, Madeleine, Richard Samworth, Michael Stewart. 2010. Maximum likelihood estimation of a multi¬ 
dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Method¬ 
ology) 72(5) 545-607. 



Lam and Mottet: Worst-case Tail Analysis 

Article submitted to Operations Research-, manuscript no. (Please, provide the manuscript number!) 


35 


Davis, Richard, Sidney Resnick. 1984. Tail estimates motivated by extreme value theory. The Annals of 
Statistics 1467-1487. 

Davison, Anthony C, Richard L Smith. 1990. Models for exceedances over high thresholds. Journal of the 
Royal Statistical Society. Series B (Methodological) 393-442. 

Delage, Erick, Yinyu Ye. 2010. Distributionally robust optimization under moment uncertainty with appli¬ 
cation to data-driven problems. Operations Research 58(3) 595-612. 

El Ghaoui, Laurent, A Nilim. 2005. Robust solutions to Markov decision problems with uncertain transition 
matrices. Operations Research 53(5). 

Embrechts, Paul, Rdiger Frey, Alexander McNeil. 2005. Quantitative risk management. Princeton Series in 
Finance, Princeton 10. 

Embrechts, Paul, Claudia Kliippelberg, Thomas Mikosch. 1997. Modelling extremal events, vol. 33. Springer 
Science & Business Media. 

Fisher, Ronald Aylmer, Leonard Henry Caleb Tippett. 1928. Limiting forms of the frequency distribution of 
the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical 
Society, vol. 24. Cambridge University Press, 180-190. 

Follmer, Hans, Alexander Schied. 2011. Stochastic finance: an introduction in discrete time. Walter de 
Cruyter. 

Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2007. Large deviations in multifactor portfolio credit 
risk. Mathematical Finance 17(3) 345-379. 

Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2008. Fast simulation of multifactor portfolio credit 
risk. Operations Research 56(5) 1200-1217. 

Glasserman, Paul, Jingyi Li. 2005. Importance sampling for portfolio credit risk. Management Science 
51(11) 1643-1656. 

Gnedenko, Boris. 1943. Sur la distribution limite du terme maximum d’une serie aleatoire. Annals of 
Mathematics 423-453. 

Goh, Joel, Melvyn Sim. 2010. Distributionally robust optimization and its tractable approximations. Oper¬ 
ations Research 58(4-part-l) 902-917. 



36 


Lam and Mottet: Worst-case Tail Analysis 
Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 


Gumbel, Emil Julius. 2012. Statistics of Extremes. Courier Corporation. 

Hannah, Lauren A, David B Dunson. 2013. Multivariate convex regression with adaptive partitioning. The 
■Journal of Machine Learning Research 14(1) 3261-3294. 

Hansen, Lars Peter, Thomas J Sargent. 2008. Robustness. Princeton University Press. 

Heidelberger, Philip. 1995. Fast simulation of rare events in queueing and reliability models. ACM Trans¬ 
actions on Modeling and Computer Simulation (TOMACS) 5(1) 43-85. 

Hill, Bruce M, et al. 1975. A simple general approach to inference about the tail of a distribution. The 
Annals of Statistics 3(5) 1163-1174. 

Hogg, Robert V, Stuart A Klugman. 2009. Loss Distributions, vol. 249. John Wiley & Sons. 

Hosking, Jonathan RM, James R Wallis. 1987. Parameter and quantile estimation for the generalized pareto 
distribution. Technometrics 29(3) 339-349. 

Iyengar, Carud N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2) 257-280. 

Karr, Alan F. 1983. Extreme points of certain sets of probability measures, with applications. Mathematics 
of Operations Research 8(1) 74-85. 

Koenker, Roger, Ivan Mizera. 2010. Quasi-concave density estimation. The Annals of Statistics 2998-3027. 

Lim, Eunji, Peter W Clynn. 2012. Consistency of multidimensional convex regression. Operations Research 
60(1) 196-208. 

McNeil, A. J. 1997. Estimating the tails of loss severity distributions using extreme value theory. The Journal 
of the International Actuarial Association 27 117-137. 

Nicola, Victor F, Marvin K Nakayama, Philip Heidelberger, Ambuj Coyal. 1993. Fast simulation of highly 
dependable systems with general failure and repair processes. IEEE Transactions on Computers 42(12) 
1440-1452. 

Petersen, Ian R, Matthew R James, Paul Dupuis. 2000. Minimax optimal control of stochastic uncertain 
systems with relative entropy constraints. IEEE Transactions on Automatic Control 45(3) 398-412. 

Pickands HI, James. 1975. Statistical inference using extreme order statistics. The Annals of Statistics 


119-131. 



Lam and Mottet: Worst-case Tail Analysis 

Article submitted to Operations Research-, manuscript no. (Please, provide the manuscript number!) 


37 


Popescu, loana. 2005. A semidefinite programming approach to optimal-moment bounds for convex classes 
of distributions. Mathematics of Operations Research 30(3) 632-657. 

Rockafellar, Ralph Tyrell. 2015. Convex analysis. Princeton university press. 

Seijo, Emilio, Bodhisattva Sen, et al. 2011. Nonparametric least squares estimation of a multivariate convex 
regression function. The Annals of Statistics 39(3) 1633-1657. 

Seregin, Arseni, Jon A Wellner. 2010. Nonparametric estimation of multivariate convex-transformed densities. 
Annals of statistics 38(6) 3751. 

Smith, James E. 1995. Generalized chebychev inequalities: theory and applications in decision analysis. 
Operations Research 43(5) 807-825. 

Smith, Richard L. 1985. Maximum likelihood estimation in a class of nonregular cases. Biometrika 72(1) 
67-90. 

Smith, Richard L. 1987. Estimating tails of probability distributions. The annals of Statistics 1174-1207. 

Talluri, Kalyan T, Garrett J Van Ryzin. 2006. The theory and practice of revenue management, vol. 68. 
Springer Science & Business Media. 


Winkler, Gerhard. 1988. Extreme points of moment sets. Mathematics of Operations Research 13(4) 581-587. 



e-companion to Lam and Mottet: Worst-case Tail Analysis 


eel 


Appendix 

EC.l. Proofs for Section |4] 

We need several results from convex analysis to prove Lemma For any convex function 5 on M, let 


dom = {x G M : g{x) < 00 } be its effective domain. The following theorems are from Rockafellar 


(2015), specialized to convex functions g with dom g = . 


Theorem EC.l (a.k.a. Rockafellar ( ]2015 ), Corollary 10.1.1). A convex function finite on 
M is necessarily continuous. 


Theorem EC.2 (a.k.a. Rockafellar ( ]2015 ), Theorem 24.1). Let g be a closed proper convex 
function on M, such that dom 5 = M. Then gf exists and is a finite non-decreasing function on M. 
Moreover, g'^ is right-continuous, i.e., lim^x^^, 5 r(,_( 2 :) = gf{x) for any xGM. 

Theorem EC.3 (a.k.a. Rockafellar ( ]2015 ), Corollary 24.2.1). Let g be a finite convex func¬ 
tion on a non-empty open real interval I. Then 


g{y)-g{x)= / gf{t)dt 


for any x and y in I. 


Theorem EC. 4 (a.k.a. Rockafellar ( ]2015 ), Theorem 24.2). Let (f be a non-decreasing func¬ 
tion from M to [— 00 , 00 ] such that ip{b) is finite for some 6 G M. Then the function given by 

g{x) = / ip{t)dt 
Jb 

is a well-defined closed proper convex function on M. 

Proof of Lemma [7[ Throughout this proof, without loss of generality let a = 0 (by replacing 
f{x) with /(x + a), and h{x) with /i(x + o) respectively). Note that optimizations ([^ and ([^ do 
not depend on /(x) for x < 0. For the purpose of applying Theorems EC.l EC^ more directly, 
let us extend / to M“, by defining /(x) = r] — nx for x < 0 (this extension of / is a mathematical 
artifact and does not necessarily match the given true density). 
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Let be the feasible region in Q , and T 2 be the feasible region in Q . We show that J^i = J^ 2 - 
Proof of C J^ 2 - Since f{x) < 00 for at least one x G M (e.g., take x = 0) and /(x) > 0 > —00 for 


all X gM, we get that / is proper (Rockafellar (2015), p.24). 

Next, we argue that /(x) < 00 for all x G M. Suppose on the contrary that /(xq) = 00 for some x > 
0. If f{y) < 00 for some y > Xq, then {{y - Xo)/y)/(0) + (xo/y)/(y) = ((y - Xo)/y)r] + (xo/y)/(y) < 


cx) = /(xo), contradicting (Id). But if f{y) = 00 for all y > Xq, then f{t)dt = 00 , contradicting 


la|). Therefore, /(x) < 00 for all x G M, and with (le), we conclude that / is finite. 


Since / is proper, closedness is the same as lower semi-continuity (Rockafellar (2015), p.52). 


Since / is finite on M, Theorem EC.l implies that / is continuous. Hence / is closed. 


Therefore, together with the convexity condition in (Id), Theorem EC.2 implies the existence of 


/i that satisfies (2c). Moreover, Theorem EC.3 implies 


Next, with the monotonicity of by (2c), we have /+(x) > /+(0) = —u for all x > 0, thus 
implying the first inequality of (2d). To prove the second inequality in (|2d|) , suppose in the contrary 


that /+(xo) > 0 for some Xq > 0. Since /+(x) > /+(xo) > 0 for all x > Xq by (2c), we have, from 


, fix) = Jq f+i't)dt + 7] ^ 00 , implying that f{x)dx = 00 and contradicting ([l^. Hence the 


second inequality in (2d) holds. We have therefore shown (2d). 


Lastly, suppose that /+(x) 74 0. Then, since (2d) holds, there exists a sequence Xfc —>■ 00 such 


that /i(xfc) —)> c < 0. But since /i is monotone by (2c), lim 2 ,_>oo fAx) exists and must equal c. But 


then, from ([^, /(x) = -|-y —)■ — 00 , violating (le). Thus ( [2e| holds. 

The constraints (2a) and (2b) follow immediately from (la) and ( 0 . We therefore conclude 
that J^i C 


Proof of 74 C JTi: Since is bounded on M by (2d), Theorem EC.4 and (2c) (with ff(x) defined 


as —17 for X < 0 ) implies that the / defined by (M is convex on M, giving (Id) 


Suppose /(xo) < 0 for some Xq > 0. Then, since /+ < 0 by (2d), (2f) implies /(x) < 0 for all 


x>Xo- Thus /g f(x)dx = —oo, contradicting (2a). Therefore, (le) holds. 

The constraint ( pd] ) implies ( [l^ immediately. The condition ([^ implies /(O) = /(0-|-). Thus, 
combining with ( 2 b), we get that holds. Einally, note that ( [2a| is the same as dla] ). We conclude 
that F 2 <Z Fi- □ 
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Proof of Theorem\^ Throughout this proof, without loss of generality let a = 0. For conve¬ 
nience, we let H{x) = h{u)du and H{x) = H{u)du. Consider the objective function of (§. 
Since H is continuous and / is absolutely continuous with f{x) = ff{t)dt + ?] hy ([^, we have, 
using integration by parts, 


h{x)f{x)dx = H{x)f{x) — / H{x)fAx)dx = — / H{x)f'Ax)dx (EC.l) 

0 Jo Jo 


where the second equality follows from Lemma EC.l (presented next) and that H{x) = 0{x) as 


x—)■ oo since h is bounded. As H is continuous and /i has bounded variation by (2d) and (2c), we 


have, using integration by parts again, that (EC.l) is equal to 


-H{x)fAx) + / H{x)dfAx)= / H{x)dff{x) 


(EC.2) 


where the equality follows from Lemma EC.l and that H{x) = O(x^) as a; —)• oo since h is bounded. 


For (2a), we can write 


pOO 

/ f{x)dx = 
Jo 


roo 2 

' X 


df+ix) 


(EC.3) 


by merely viewing /i = 1 in (EC.l) and (EC.2). Also, since /(x) —>■ 0 as x —>■ oo by Lemma EC.l 


we can use integration by parts again to write 

pOO 

/(o) = - / f+{^)dx = -x/+(x) 

Jo 


+ / xdff{x)= / xdf'Ax) 


(EC.4) 


where the third equality follows from Lemma EC.l again. Therefore, ([^ can be written as (letting 
a = 0) 


max f^H{x)dff{x) 

subject to ^dff{x) = (3 
J^xdff{x) = r] 

f+{x) exists and is non-decreasing and right-continuous for x > 0 
—u < /+(x) < 0 for X > 0 
f'+{x) —)• 0 as X —>■ oo 


(EC.5) 
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and the last constraint ([^ in Q states that / can be recovered from f{x) = fAt)dt + ri. Note 
that this definition of / must necessarily have a right derivative coinciding with the obtained /+(x). 


Finally, let p{x) = /+(x)/i/ + 1. Then (EC.5) can be rewritten as 


max 

p 


^fo H{x)dp{x) 
subject to x‘^dp{x) = ^ 

ir ^Mx) = I 

p{x) non-decreasing and right-continuous for x > 0 
0 < p{x) < 1 for X > 0 
p{x) —)■ 1 as X —>■ oo 


(EC. 6 ) 


or equivalently 


max 

p 


(EC.7) 


V n^H{x)dp{x) 
subject to x^dp(x} = ^ 

fZo^Mx) = l 

p{x) non-decreasing and right-continuous for x G M 
0 < p{x) < 1 for X G M 
p(x) —)• 1 as X —)> oo 
p{x) = 0 for X < 0 

since H{x) = x = x^ = 0 at x = 0. One can uniquely identify, up to measure zero, a non-decreasing, 
right-continuous p such that lim 2 ;_,,oo p{x) = 1 and p{x) = 0 for x < 0 with a probability measure 
supported on M+. Hence (EC.7) is equivalent to ([^. This concludes the result. □ 

Lemma EC.l. If f is a feasible solution of ([^, equivalently ([^, t/ien x/(x) —)• 0 and x^/j_(x) —)> 0 


as X —)• oo. 


Proof of Lemma\EC.l. We need the observations that /(x) is non-increasing by (2d), /(x) > 0 


for all X > a by ( |Tel ), and that / is integrable on [a,oo) with f{x)dx = /? by (la). Denote 
F{x) = f{t)dt and g{x) = x/(x) — F{x). Consider, for a V 0 < Xi < X 2 , 


9{x 2) - g{Xi) = X2/(X2) - Xi/(Xi) - (F(x 2) - E(xi)) 
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< X 2 f{x 2 ) — Xif{xi) — f{x 2 ){x 2 — Xi) siiice f{x) is non-increasing 
= xi[f{x 2 )- f{xi)] 

< 0 again since / is non-increasing 


Therefore g is non-increasing for x > a V 0, and since xf{x) > 0 and 0 < F{x) < /3 for x > a V 0, 
we have g bounded from below on the same range. This implies that g must converge to a limit, 
say c, as X —)• oo. In other words, x/(x) — F{x) —)■ c, and since F{x) —/?, we have x/(x) c + (3. 
Since x/(x) > 0 for x > a V 0, there are two cases: c-|-/3>0orc-|-/3 = 0. The first case implies that 
xf{x) > e > 0 for some e for all large enough x. This means /(x) > e/x for all large enough x, and 


hence f{x)dx = oo, which contradicts (la). Therefore x/(x) must converge to 0. This proves 
the first part of the lemma. 

To prove the second part, we need the observation that /+(x) is non-decreasing for x > a by 


(2c), and is non-positive for x > a by (2d). Also, by ([^ we have /(x) = fA't)dt + r] for x > a 


Let F{x) = f{t)dt for x > a, which is finite and converges to 0 by (la). We now define g{x) = 
—3:^/+(®) + 2.F(x), where F{x) = — tfAt)dt, for x > a. Note that x/^x) is integrable on [a,oo) 
because the absolute continuity of /, and lim3._,,oo x/(x) —)• 0 as we have just proved, which allows 
integration by parts yielding 


F{x) = - tf'At)dt = -tf{t)\'^+ f{t)dt = xf{x) + F{x) <oo 


(EC.8) 


For any (a V 0) < Xi < X 2 , 


gix 2 ) - g{xi) = Xi/+(xi) - X 2 /+(x 2 ) - 2F(xi) -h 2 F(x 2 ) 


< xlfA^i) — X 2 /+(x 2 ) -|- /+(x 2 )(x 2 — xl) siuco /^x) is non-decreasing 
= x?(/;(xi)-4(x2)) 

< 0 again since /^x) is non-decreasing 


Therefore, g{x) is non-increasing for x > a. Note that —x^/j_(x) > 0 for x > a. Also, from (EC.8), 


since lim^ 


, x/(x) —)■ 0 and F(x) —)■ 0, we have F(x) —)• 0 as x —)■ 00 and hence also bounded for 
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large enough x. Therefore g is bounded from below. This implies that g must converge to a limit, 
say c, as X —)■ oo. Since F(x) —)■ 0, we have —x^/^(x) —)■ c. Since —x^/^(x) > 0 for x > o, there are 
two cases: either c > 0 or c = 0. The former case implies that —x/^(x) > e/x for some e > 0 and 


large enough x, and so F{x) = — xf'Ax)dx = oo for x > a, which contradicts (EC.8). Therefore 


This proves the second part of the lemma. 


□ 


To prove Theorem]^ we need several results from Winkler (]1988) stated below. 


Theorem EC. 5 (Winkler (1988) Theorem 2.1(b)). Let X he a measurable space with a-field 


T and suppose thatV is a simplex of probability measures whose extreme points are Dirac measures. 
Fix measurable functions /i,..., /„ and real numbers Ci,..., c„. Consider the set 

Then TL is convex and 

^ m m 

exF. = I q£F'. q = ti ■ S{xi), ti > 0, = 1, Xi^ X, 1 <m <n + 1, 


i=l 


i=l 


the vectors (/i(xj),..., /„(xj), 1), 1 < z < m, are linearly independent j 


where ex F denotes the set of all extreme points ofF. 


Theorem EC. 6 (Adapted from Winkler (1988) Theorem 3.2). Let X be a Hausdorff 


space, F he the Borel a-field and Vr{X) be the set of regular probability measures on X . Let 


F = <q^ Vr{X) : is q-integrable and / fidq = c^, 1 < i < n 


Every measure affine functional J on F fulfills 


sup{ J(g) : q G F} = sup{ J(g) : g G ex F} 


Theorem EC.6 is precisely Theorem 3.2 in Winkler (1988), except replacing the inequalities with 


equalities for the moments that define F, which is immediate (and is pointed out by the comment 
right after the theorem in Winkler ( 1988|)). 
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Proposition EC.l (Winkler (1988) Proposition 3.1). Let X, T and % he given as in The¬ 


orem 


EC. 6 and the function g on X be integrahle for every q^TL (possibly with integral values oo 


or —oo). Then the functional G on LL defined by G{q) = jp^gdq is measure affine. 


Proof of Theorem\^ By Examples 2.1(a) in Winkler (1988), the set V in Theorem EC.5 can be 
chosen to be the set of all regular probability measures. On Polish space every probability measure 
is regular. Therefore, on the space M'*', which is Polish, we can take V in Theorem EC.5 as the 


set of all probability measures. The TL in Theorems EC.5 and EC.6 then coincide. By Proposition 


EC.l, the objective nK[H{X)] in OPT('P+) is measure affine. Therefore, using Theorems EC.5 and 


EC.6, and noting that 

^ m m ^ 

TLP Iq^TL ■ q = ti > 0, = 1, Xi^ X, l<m<n+l>5ex?^ 


i-l 


i=l 


for the coincided Ti in Theorems EC.5 and EC.6 we conclude the theorem. 


□ 


Proof of Proposition^ If program (|^ is consistent, then by Theorem]^ either an optimal 
solution in Vf exists, which corresponds to the hrst case of the lemma, or there exists a feasible 
sequence G Vf such that —)■ Z*. Let ~ ix[''\x 2 ^\x'"^\p[’‘\p 2 ^\p''^'^). Suppose that 

xfs are all bounded above by a number, say M. Then, since [0,M]^ x cSa is a compact set, by 
Bolzano-Weierstrass Theorem we must have a subsequence of {x[^\x 2 ^\x^^^\p[^\p 2 ^\p 2 ^^), say 
{x[^'\xf^\xf^\p[^^\pi^^\pf^'') converge to {xt,x^,x* 3 ,pl,p^,pl) in [0,Mf x S 3 . Since H{x) is 
continuous by construction, we have Z(F^’^L) = n ^ ^Yl\=i^Wi)P* — Z{F*), 

where P* {x\,x* 2 ,x* 3 ,p\,p 2 ,p* 3 ). As ZfF^'^L') is a subsequence of Z{F^^'>), Z{F*) must be equal to 
Z*, and so P* is an optimal solution, which reduces to the first case in the lemma. Therefore, for 
the second case, we should focus on the scenario that at least one satisfies limsup^,_,,g^ = 00 . 

Without loss of generality, we fix the convention that < xf'\ If at least one of xf"^ 

satisfies limsup^,_,,^ x= 00 , we must have limsupj,_,,^ Xg^^ = 00 . In order that P^^^ is feasible, 
= n holds and so x^^^ < /U for all k. We now distinguish two cases: either Xa^^ is uniformly 
bounded, say by a large number M > //, or limsupj,_,,^ x^*^ = 00 also. Consider the first case. First, 
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we find a subsequence oo. Since G [0, M]^ which is compact, we can choose a fur¬ 

ther subsequence kj' such that {x^^^'\x^ 2 ^'\x^^^'^) —)> (xJ,X 2 ,oo) where [xl^x^) G [0,M]^. Now, since 


(kj) (kj) 


/ (fc,/) (fc,/) 

(Pl ,P2 

(fc,/) 

>P3 

that {p^i^"\ 

(fc-//) 

P 2 

(k,//) (k^u)'^ 


P 2 ^2 

+ P3' 


that ’■,P2^"'') {piip2^Pl) ^ S-i- Note that by the constraint + 

(k-u) , (k-tt) (fc,//) 

^3 = cr, we must have p^ = — P\ — P2 ^2 )/x^ < 

(A; •//)2 

crfx^^ —)• 0 . In conclusion, in this case, we end up being able to find a sequence of measures 
^{x\\xy ,xy ,P\ ,p^2 ,P3 ) with {x\\x^2 ,xy ,p\ ,p^2 ,P3 {Xi,X 2 ,OO,p^,P 2 , 0 ) 

where xl,X2 G M+ and (pl,P2) g 52 . 

For the second case, namely when limsupfc_,.gc = 00 for both i = 2 and 3 . We can 
argue similarly that there is a sequence of measures ~ {xf^'^ ,X2^^ ,X3^^ ,p['^'^ ,P2^^ ,P3^^ ), 

u j-i, (fc)' (fc)' > j (fc)' (fc)' . n T 4-U J / (fc)' (fc)' (fc)' (fc)' (fc)' (fc)'\ > 

such that X 2 5 X 3 —)■ 00 and P 2 ,p\ —)• L). in other words, (x^ ,X 2 ,x\ ,p\ ,P 2 ,^3 j — 
(xj, 00 , 00 ,1,0,0) where X* G M''". □ 

Proof of Lemma\^ It follows from Jensen’s inequality that for any P G V^, E[X^] > E[Jf]^, 
which gives a> p? in Q. On the other hand, if a > /i^, it is also rudimentary to find P G Vf with 
E[X] = p and E[X^] = a. Substituting p = pfv and a = 2/3/i/, we get < 2j3v. Lastly, E[Jf^] = 
E[X]^ if and only if P is a point mass. The equivalent statements regarding program 0 follows 


from Theorem [2j 

Proof of Proposition^^ Consider a sequence f^^\x),x > a given by 

p — ^{x — a) for o < X < x^^^ + a 

(x) = < fj — — vp^ 2 ^ {x — a — x[^^) for x^^^ J- a < x < X 2 ^^ + a 

0 


□ 


for X 2 *^ + a<x 


where 


=p- and 


(fc) 

X 2 —M 


X 


(fc) 


■ 00 


(fc) 1 (fc) 
p\' = i-py 


(fc) 

=jfc) 


cr—fL 


—2p.X^^ -\-<T 


(EC.9) 
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and ijL,a are defined in Q. 

It is obvious that for large enough is non-negative and convex. Moreover, /(^^(a) = 

f^’"^{a+) = rj and (a) > —u. To show f{x)dx = /3, we first verify that 

_^pik)^ik) _ p (EC. 10) 


and 


Jk)(k)‘^ , Ak)Jk)^ 


p\"’'x\"‘' +p)2'x)2' =(J (EC.11) 

for all k. In fact, we will do so by showing that 7 ^^^ and p^'^ displayed in (EC.9) are the unique 


choices that satisfy (EC.10) and (EC.11) and also xf"'’ = p — and = l—pf'\ With the latter 


conditions, (EC. 10) and (EC. 11) can be written as 




and 


(1 +pi^^xi^'>‘^ = a 


respectively, which further gives 




and 


pf> '-(*) 


(4*’'-(f.-7<‘7) + 7-^ 




= a 


From (EC. 12) we have 


(fc) 7 

P 2 = - 


(fe) 


yl'"! -t- — p 


Putting (EC. 14) into (EC. 13), we get 


.. ^'(1) -{p-+(p- 7^'=^)'= 

^(k) 4. 3 .P) _ ^ V ^ y ^ 

which can be simplified to 


a 


(EC.12) 


(EC.13) 


(EC.14) 


7<‘> (i7+M-7<*')+(m-7'‘>)‘ 


= a 
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giving 


I (fc) 

3^2 


Plugging (EC. 15) into (EC. 14), we have 


(EC.15) 


(fc) 

P2 = 


a — 


(o--//2) + -/i) 


thus recovering 7 ^^^ and in (EC.9). 
Therefore, 


1 +a +a 

f^^\x)dx= / [rj — v{x — a)\dx+ / [p — vx^^"'— vp^ 2 \'^ ~ xt"')]dx 

C(fc)_ 


I x\ ^ +a 


jfc) 


( fc ) ^ ( fc )2 ( fc ), ( fc ) ( fc ), ^'^2 / ( fc ) ( fc ) N 2 

= pX2 — -Xl —VX\ {X 2 — X\ ) - - — {X 2 — X{ j 


(fc) (fc) 

(fc) I (fc )2 i^P 2 (fc)^ (fc) (fc) (fc) 


ift;) , *" r'1 i/t; . “ x'z Ifi;! Ifvi ift;i (fc) (fc)"^ • ift:j -1 u 

= 7 X 2 H- - — x\ H-^—X 2 — fc'Pi x) X 2 — X 2 using p) = 1 — P 2 


dfc) 


(EC.16) 


(fc) 


= 7 X 2 ^^ + ^— 1 ^X 2 ^^/i using (EC.10) and (EC.11) 


= /3 using p — ufi = 0 and /3 = va 12 


Hence is feasible for (Q for large enough k. 
Now, the objective value evaluated at is 


I 


(fc) , 

x; +a 


(fc), 

xX +a 


h{x){p — v{x — a))dx + 


h{x){p — ux[^^ — vp 2 ^\x — a — x^^^))dx (EC.17) 


Jfc) 


(fc) I 
x\ ^+a 


(k)^ 


The first term in (EC. 17) is bounded since x^^^ —)• p. We focus on the second term. By the assump¬ 


tion, we can find C > 0 such that h{x) > Cx'^ for all x > a. Then, for large enough k, 


I 


(fc) I 
xX -\-a 


h{x){r] — — up 2 ^^{x — a — x[^^))dx 


(fc) , 

x) ' +a 


{k)i 


Xk)) 


> c 


> c 


(fc)_l_ 
f'X^ +ci 


x^^p — — i'P 2 ^\x — a — x[^’))dx 


(k). 


I (fc) I 
f x^ ' -\-a 


Xk)', 


(fc) I 

xX '-\-a 


(fc)^ 

x; +a 


[{p — i'p[^^x^i'‘ -I- i'P2^^a)x’' — lyp^Px’^^^jdx 


= {p — iyp[^^x[^'^ + vp''2’a) 

= (p — + h'P2^^a) 

(X 2 ^^ -|- 0 )*^+^ 


(fc) , a; 


^2 +“ (k) x"+"^ 

x[k)+a~'^^^ 7+2 


e + 1 
(X2^^ -|- 0 )*^+^ 


xX ’ +a 


(fc) , 

x^ ' +a 


— {p — vp^i^x^i"’ -H vp7’a) 


(fc) 

VP2 


+ l^P2 


(fc) 


e + 1 
(x(^^ + a)'^+^ 


(fc) .{xT’ + aY 


e + 1 


6 + 2 


6 + 2 


(EC.18) 
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Note that since —)> 1, —)• p, 0 p — vp = 0, the second term in (EC.18) converges to 


0. Moreover, since P 2 ^'^ —>■ 0, the fourth term also converges to 0. Consider the first term in ( |EC.18[ ). 
In particular, 

rj — + vp^ 2 ^a = 7] — z/(l —pi^^){p — 7 ^^^) + yp^ 2 ^a 

= p^^'^ [p + a) 

by using 7 — = 0 and = 1 —pi^'’. Substituting = {a — p?) / ~ P 2 ^^ = 0 (l/x 2 ^^ ), 

and using p^^^ —)• 1 , we have 

^ (k) (fc) , (fe) ^(xf^ + a)'^+i (fc) (k) , (fc)/ , + u{a-p^)x^2^ 

{p-up\ ^x\ ’ + up)2 ’a) --= {p\ {fJ-+a)) --(l + o(l)) 


On the other hand, for the third term in (EC. 18), substituting = (cj — ^^)/(x 2 ^^ — 2px^2'^ + a), 
we have 

(xf^ + a)'+2 


(fc) 

-iyp^2 


e + 2 


e + 2 


Thus, (EC. 18) is equal to 


e + 1 e + 2 


^ z/((T — p‘^)x 2 ^^ (1 + 0 ( 1 )) —>■ 00 


and hence the optimal value of ([^ is 00 . 

EC.2. Proofs for Section |5] 

To prove Proposition!^ we borrow the following result: 


□ 


Lemma EC.2 (Adapted from Theorem 5.1 in Birge and Dula (1991)). Consider 
OPT{V[0,c]) for any 0 < c < 00 . Suppose H is convex with derivative H' convex on (0,c) and 
concave on (c, c) for some 0 < c < c. //OPT('P[0, c]) is consistent, then an optimal solution exists 
and lies in 'P 2 [ 0 , c]. 


This lemma follows from Theorem 5.1 in Birge and Dula ( 1991| that applies to the associated dual 
problem. 
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Proof of Proposition^^ By Theorem]^ OPT{V^) has the same optimal value as OPT{Vf) 
By Lemma |EC .2 for every P G which necessarily has bounded support say on [ 0 , M] for some 
M > 0 , there exists P' G P 2 [ 0 ,M] such that Z(P') > Z(P). Hence OPT{V^) has the same optimal 
value as OPT{Vf), which concludes the proposition. □ 

Proof of Proposition^ Proof oflU Let the optimal probability measure in Pf be represented 
by {xi,X2,Pi,P2)- Note that Xi / X2 since otherwise a = pP. Adopting a similar line of analysis 


as m 


Birge and Dula ( 1991 ), we let Xi < X2 without loss of generality. For a two-support-point 


distribution to be feasible, we must have Xi < /r. Feasibility also enforces that piXi +P2X2 = p, Pix\ -|- 
P2x\ = a and pi +p2 = 1 . Hence P2 = 1 —Pi-, which gives piXi + (1 —pi)x2 = p and pixl + (1 —pi)xl = 
a. From the first equation we get pi = {x2 — p)/{x2 — Xi). Putting this into pixl + (1 — Pi)x2 = cr, 
we further get X2 = (u — pxi)/{p — Xi). Now, putting this in turn into pi = {x2 — p)/{x2 — Xi), we 
obtain px = {a — pf) j (a — 2 pxi -|- x^) and hence p2 = 1 — Pi = {p — XiY/{a — 2 pxi + x'l). Therefore, 
Z* is given by 

(P-Xif 


max u(piP{(xi)+p2p[(x2)) = max u( -^— - kH(xi) 

a;iG[0,M) a:i6[0.M) \a — 2pXi+Xl 


tH 


a — pxi 


a — 2pxi+xi \p — Xi 

which is exactly maXa;jg[o,^) W{xi). 

Proof of[^ Let P^^'^ ~ {x^i\x^2\p^i\p^2'^) be a feasible sequence with —)■ Z*. Without 

loss of generality let x^^^ < x^^'^. Since -|-P2^^X2^^ = p, we must have x^^^ < p. Then we 

must have a subsequence x^^^ —00, since otherwise {x[^\x2^\p[^\p2^'^) would lie in a compact 
set and there would exist a subsequence {x^^"\x^2"\Pi^"\P2'"^) {xl,X2,pl,P2), where Z(P(''b) = 

^Ylij=iPj ' H{x^ ' ) —^ i^'^j=iP^H{x*) by the continuity of H, violating the non-existence of 
optimal solution. By have = {a — pf'"'’xf""^ Oj and 


p^2^''x^2^’ = {a — p[^"’ x[^"’ )lx'f'"^ —> 0 . Thus pi*'*'' = 1 —> 1 and xf"^’ = {p — P2'""’ xf""’) /pi 


Ski)(kiY 


AY) 


(k. 


Pi 


AY 


A^A^Pi)\ /ry-.P'i 


P. 


Therefore, 


= u {pP^>H{ xr>) + pT’H{ xT>)) =Z/ pr’H{xr>) + 


(Y)JY)^ 


a-pjjfxl 

A^iP 


-H{x 


(^i) 

2 ) 


u{H{p) + \{a - p^)) 
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Proof of|3l First, we show that W{xi) —> v{H{pL) + A(cr — as Xi /r. Consider the second 
term of W{xi) given by 

— i^XiY f /i —Xi ^ ^ 


lim 

a — 2/iXi + X 2 


a — jixi 

fl-Xi 


= lim 


H 


a — fiXi 

fi-Xi 


= uX{a — ^ ) 


a — 2^X1 + X2 

and the claim follows. Combining Parts and of this proposition, we must have Z* = 
maXa;jg[o,^] VF(xi). □ 

EC.3. Proofs for Section E] 


We first show a result in parallel to Theorem for the case of & 

Theorem EC.7. Suppose h is bounded. Then the optimal value of 0 is the same as 


max uK[H{X)] 

subject to pL< E[X] < Ji 
ct<E[X 2] <a 
PeP+ 


(EC.19) 


Here the decision variable is a probability distribution ¥’ , amiE[-] is the corresponding expec¬ 

tation. Moreover, there is a one-to-one correspondence between the feasible solutions to © and 


(EC.19), given by ff{x + a) = n{p{x) — 1) for x G M"*", where ff is the right derivative of a feasible 


solution f of © such that f{x) = ff{t)dt + rj for x >a, and p is a probability distribution 


function that is associated with a feasible probability measure oxer in (EC.19). 


Proof of Theorem EC. 7. Note that formulation © can be written as 


max^</ 3 <;fl. 2 <r ,<77 max h{x)f{x)dx 

subject to f{x)dx = (3 

f(a) = f{a+)=ri 
f'+{a) > -V 
f{x) convex for x > a 
/(x) > 0 for X > a 


(EC.20) 
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The inner maximization is exactly and thus by Theorem we can reformnlate (|EC.20[) 


as 


max^</3<^.5<„<^ max z/E[if(X)] 

subject to E[X] = | 
E[X2] = f 

which is equivalent to ( EC.19[ ). 

Eor convenience, we denote OPT{T>) as the program 

max i7E[if(X)] 

subject to ^ < E[X] < J1 
a<K[X^]<a 
FeV 


□ 


where P is a collection of probability measures on M. Eor example, (EC. 19) can be written as 
OPT{V^). Let Z{F) = T/E[P[{X)] be the objective function in E. 

Proposition EC.2. The optimal value of OPT(P^) is identical to that of OPT{Vf ). 


Proof of Proposition \EC . 4 Eor E feasible in OPT{V^), let p. = E[X] and a = E[X^] be its first 
and second moments. By Theorem there must exist E' G Vf with the corresponding expectations 
E'[X] = PL and E'[X2] = a such that Z(E) < Z(E'). □ 

Proof of Theorem\^ Theorem [^follows from Theorem EC.7 and Proposition EC.2 in the same 
way as the proof of Theorem □ 

Proposition EC.3. Under OPT{V^) has the same optimal value as OPT{Vf). 


Proof of Proposition \EC.3 . We know from Proposition EC.2 that OPT{V^) has the same opti¬ 
mal value as OPT{Vf). Any E G Vf must necessarily have bounded support, say on [0,M]. By 


Lemma EC.2 there must exist E' G Vf i with the same first and second moments as E, such that 


Z{F)<Z{F'). 


□ 


The following explains the origin of the two subproblems in (13): 
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Lemma EC.3. Under Assumption and let a > The optimal value of OPTiJ^^) given by 
Z* = max{Z*,^ 2 }, where Z* is the optimal value of 


max 


m[H{x)] 


and Z 2 is the optimal value of 


subject to E[X] = p. 

ct<E[X 2] <a 

max u'&[H{X)\ 

subject to p< E[X] < J1 
E[X^]=a 
PGVf 


(EC.21) 


(EC.22) 


respectively. 


Proof of Lemma \EC.3\ We argue that to solve OPT{Vf), it suffices to restrict attention to the 
feasible region {E G Pfj : E[X] =JI,a< E[X^] < a} U {P G Vf : p < E[X] < 7i,E[X^] = a}. Since h > 
0, Z* > 0. There is nothing to prove if Z* = 0. So suppose Z* > 0. There exists P (Xi,X2,Pl,P2) G 
Vf with one of the x^’s having H{xi) > 0 and Pi > 0. Now suppose P satisfies E[X] < p and 
E[X^] < a. We can increase Xi so that E[X] < Jl and E[X^] < a remain satisfied, and Z*(P) is at 
least as large as before since H (x) is non-decreasing. Hence any P such that E[X] < p and E[X^] < a 
must have Z(P) < Z(P') for some P' G {P G Vf : E[X] =p,a< E[X^] < cj} U {P G Vf : p < E[X] < 
/I,E[2f^] =cj}. This proves the lemma. □ 


Proof of Theorem^ Lemma EC.3 allows one to consider only the programs (EC.21) and 
(EC.22) when solving OPT{Vf). Theorem @ then follows from Lemma Theorem 
Proposition EC.3 using the same line of arguments in the proof of Theorem]^ 


EC.7 


and 

□ 









